Versioning Example (Part 3/3)

In part 2, we trained and logged a problematic model, and then reverted the commit to restore a good version.

Now we'll train an even better model—one that can also classify tweets in German—but this time using a separate branch and merge, instead of committing directly to master.

This workflow requires verta>=0.14.1 and spaCy>=2.0.0.


Setup

Instead of spaCy's English model, we'll be building off of a multilingual model.


In [1]:
!python -m spacy download xx_ent_wiki_sm


Requirement already satisfied: xx_ent_wiki_sm==2.2.0 from https://github.com/explosion/spacy-models/releases/download/xx_ent_wiki_sm-2.2.0/xx_ent_wiki_sm-2.2.0.tar.gz#egg=xx_ent_wiki_sm==2.2.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (2.2.0)
Requirement already satisfied: spacy>=2.2.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from xx_ent_wiki_sm==2.2.0) (2.2.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.0.2)
Requirement already satisfied: setuptools in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (41.2.0)
Requirement already satisfied: thinc==7.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (7.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (3.0.2)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (4.43.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (2.0.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.0.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (0.6.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (2.23.0)
Requirement already satisfied: numpy>=1.15.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.18.1)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (0.4.1)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.1.3)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.5.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (1.25.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (2019.11.28)
Requirement already satisfied: zipp>=0.5 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.0->xx_ent_wiki_sm==2.2.0) (3.1.0)
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
✔ Download and installation successful
You can now load the model via spacy.load('xx_ent_wiki_sm')

Then, as before, import libraries we'll need...


In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

...and instantiate Verta's ModelDB Client.


In [3]:
from verta import Client

client = Client('https://app.verta.ai')
proj = client.set_project('Tweet Classification')
expt = client.set_experiment('SpaCy')


set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy

Prepare Data

Again, things are a little different.

Our multilingual model needs German training data to classify German tweet, so we'll download two datasets from S3.

Before, we trained a model on just english-tweets.csv. Now, we're going to also train with german-tweets.csv.


In [4]:
S3_BUCKET = "verta-starter"
EN_S3_KEY = "english-tweets.csv"
EN_FILENAME = EN_S3_KEY
DE_S3_KEY = "german-tweets.csv"
DE_FILENAME = DE_S3_KEY

boto3.client('s3').download_file(S3_BUCKET, EN_S3_KEY, EN_FILENAME)
boto3.client('s3').download_file(S3_BUCKET, DE_S3_KEY, DE_FILENAME)

In [5]:
import utils

en_data = pd.read_csv(EN_FILENAME)
de_data = pd.read_csv(DE_FILENAME)

data = pd.concat([en_data, de_data], axis=0)
data = data.sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()


Out[5]:
text sentiment
0 This is so weird. Being back here makes me mis... 0
1 Me too I couldn't button my jeans today.... 0
2 i can't believe janice got voted off lameeeeeeee 0
3 it's not summer yet! 0
4 You are welcome! 1

Capture and Version Model Ingredients

As with before, we'll capture and log our model ingredients. Note that now we're logging both of our datasets from S3.


In [6]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()  # Notebook & git environment
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3([
    "s3://{}/{}".format(S3_BUCKET, EN_S3_KEY),
    "s3://{}/{}".format(S3_BUCKET, DE_S3_KEY),
])
env_ver = Python()  # pip environment and Python version


But instead of committing directly to master, we'll checkout and commit to a separate branch.


In [7]:
repo = client.set_repository('Tweet Classification')
commit = repo.get_commit(branch='master').new_branch('multilingual')


set existing Repository: Tweet Classification from personal workspace

In [8]:
commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)

commit.save("Support German tweets")

commit


Out[8]:
(Branch: multilingual)
Commit 92bf1a8d4c7e5c4dc8f00fcea5748e90023791f50f37c261fb977e630daba6ca containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

You may verify through the Web App that this commit—on branch multilingual—updates the dataset, as well as the Notebook.


Train and Log Model

Again as before, we'll train the model and log it along with the commit to an Experiment Run.


In [9]:
nlp = spacy.load('xx_ent_wiki_sm')

In [10]:
import training

training.train(nlp, data, n_iter=20)


Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
16.027	0.752	0.764	0.758
0.361	0.775	0.744	0.759
0.104	0.791	0.737	0.763
0.090	0.790	0.729	0.758
0.079	0.783	0.732	0.757
0.067	0.787	0.726	0.756
0.058	0.778	0.723	0.749
0.047	0.777	0.718	0.746
0.042	0.777	0.726	0.751
0.034	0.769	0.725	0.747
0.030	0.766	0.729	0.747
0.026	0.766	0.728	0.746
0.024	0.765	0.729	0.747
0.022	0.765	0.729	0.746
0.020	0.765	0.720	0.742
0.019	0.766	0.718	0.741
0.018	0.765	0.715	0.739
0.017	0.759	0.715	0.736
0.015	0.762	0.718	0.739
0.015	0.764	0.721	0.742

In [11]:
run = client.set_experiment_run()

run.log_model(nlp)


created new ExperimentRun: Run 4342615846618541268559
upload complete (custom_modules.zip)
upload complete (model.pkl)
upload complete (model_api.json)

In [12]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

Merge Commit

Our model seems to be handling our multilingual data just fine, so we'll merge our improvements into master.


In [13]:
commit


Out[13]:
(Branch: multilingual)
Commit 92bf1a8d4c7e5c4dc8f00fcea5748e90023791f50f37c261fb977e630daba6ca containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

In [14]:
master = repo.get_commit(branch="master")

master


Out[14]:
(Branch: master)
Commit e9f25d8206115119d202c62f540a60e6d988615e6c96e9c0701b67b8b5c2c9f9 containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

In [15]:
master.merge(commit)

master


Out[15]:
(Branch: master)
Commit 99774bcc3b84d420340c02346130713627f900dca165580210d2fd6d8094fb73 containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

Now we've merged multilingual into master, bringing in our verified and proven changes.

Again, the Web App will show this merge commit on master updating the dataset and the Notebook.